OpenAI - tiktoken ⏳ | fast BPE tokeniser

sockstack / 284 / 2024-02-27 13:02:38

ChatGPT 可用网址，仅供交流学习使用，如对您有所帮助，请收藏并推荐给需要的朋友。 <a href="https://ckai.xyz/?sockstack&section=detail" target="__blank">https://ckai.xyz</a> <article class="baidu_pl"><div id="article_content" class="article_content clearfix">
<link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/kdoc_html_views-1a98987dfd.css">
<link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/ck_htmledit_views-25cebea3f9.css">
<div id="content_views" class="markdown_views prism-atom-one-light">
<svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><path stroke-linecap="round" d="M5,0 0,2.5 5,5z" id="raphael-marker-block" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);"></path></svg> 
<div class="toc"> <h3>文章目录</h3> <ul><li>
<ul>
<li>关于 ⏳ tiktoken</li>
<li>
<ul>
<li>性能表现</li>
<li>安装</li>
<li>tiktoken 如何计算 token</li>
<li>Encodings</li>
<li>Tokenizer libraries 对不同编程语言的支持</li>
<li>How strings are typically tokenized</li>
</ul> </li>
<li>使用</li>
<li>
<ul>
<li>编解码</li>
<li>比较 encodings</li>
<li>计算chat API调用的tokens</li>
<li>拓展 tiktoken</li>
</ul> </li>
</ul> </li></ul> 
</div> 
 
<hr> 
<h2>
<a id="__tiktoken_2"></a>关于 ⏳ tiktoken</h2> 
tiktoken is a fast BPE tokeniser for use with OpenAI’s models. 初看这个名字，以为是跟 tiktok 相关，没想到是 openai 下面的，这取名还真是有趣呢。 
<ul>
<li>github https://github.com/openai/tiktoken</li>
<li>openai-cookbook / examples / How_to_count_tokens_with_tiktoken.ipynb https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb</li>
</ul> 
<hr> 
<h3>
<a id="_14"></a>性能表现</h3> 
tiktoken 比其他开源 tokeniser 快 3-6 倍 基于 1GB 文本进行测试，使用 GPT-2 tokeniser，使用 <code>GPT2TokenizerFast</code> from <code>tokenizers==0.13.2</code>, <code>transformers==4.24.0</code> and <code>tiktoken==0.2.0</code>。 
<img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/a852581e177340d389d8e0c0147a7bbb.png" alt="在这里插入图片描述"> 
<hr> 
<h3>
<a id="_22"></a>安装</h3> 
<pre><code class="prism language-shell">pip install tiktoken 
</code></pre> 
<hr> 
<h3>
<a id="tiktoken__token_30"></a>tiktoken 如何计算 token</h3> 
给定一个文本字符：<code>"tiktoken is great!"</code>，和一个 encoding，比如 <code>"cl100k_base"</code>。 一个 tokenizer 可以讲文本字符串分割成一系列 tokens，如： <code>["t", "ik", "token", " is", " great", "!"]</code> 
GPT 模型使用这种类型的 token。 知道文本字符串中有多少令牌，可以告诉你（a）字符串是否太长，文本模型无法处理，以及（b）OpenAI API调用的成本（因为使用是按令牌定价的）。 
<hr> 
<h3>
<a id="Encodings_41"></a>Encodings</h3> 
编码指定如何将文本转换为标记。不同的模型使用不同的编码。 
OpenAI models 使用 <code>tiktoken</code> 支持下面三种编码： 
<table>
<thead><tr>
<th>Encoding name</th>
<th>OpenAI models</th>
</tr></thead>
<tbody>
<tr>
<td><code>cl100k_base</code></td>
<td>
<code>gpt-4</code>, <code>gpt-3.5-turbo</code>, <code>text-embedding-ada-002</code>
</td>
</tr>
<tr>
<td><code>p50k_base</code></td>
<td>Codex models, <code>text-davinci-002</code>, <code>text-davinci-003</code>
</td>
</tr>
<tr>
<td>
<code>r50k_base</code> (or <code>gpt2</code>)</td>
<td>GPT-3 models like <code>davinci</code>
</td>
</tr>
</tbody>
</table> 
您可以获取一个模型的编码 ，使用 <code>tiktoken.encoding_for_model()</code> 如下： 
<pre><code class="prism language-python">encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
</code></pre> 
注意，<code>p50k_base</code> 与 <code>r50k_base</code> 基本类似，对于非代码应用程序，它们通常会给出相同的令牌。 
<hr> 
<h3>
<a id="Tokenizer_libraries__63"></a>Tokenizer libraries 对不同编程语言的支持</h3> 
对于 <code>cl100k_base</code> 和 <code>p50k_base</code> encodings: 
<ul>
<li>Python: tiktoken</li>
<li>.NET / C#: SharpToken</li>
</ul> 
<hr> 
对于 <code>r50k_base</code> (<code>gpt2</code>) encodings, tokenizers are available in many languages. 
<ul>
<li>Python: tiktoken (or alternatively GPT2TokenizerFast)</li>
<li>JavaScript: gpt-3-encoder</li>
<li>.NET / C#: GPT Tokenizer</li>
<li>Java: gpt2-tokenizer-java</li>
<li>PHP: GPT-3-Encoder-PHP</li>
</ul> 
（OpenAI不对第三方库进行背书或保证。） 
<hr> 
<h3>
<a id="How_strings_are_typically_tokenized_84"></a>How strings are typically tokenized</h3> 
In English, tokens commonly range in length from one character to one word (e.g., <code>"t"</code> or <code>" great"</code>), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., <code>" is"</code> instead of <code>"is "</code> or <code>" "</code>+<code>"is"</code>). You can quickly check how a string is tokenized at the OpenAI Tokenizer. 
在英语中，tokens的长度通常从一个字符到一个单词（例如，<code>t</code> 或 <code>great</code> ），尽管在一些语言中，tokens 可以短于一个字符或长于一个单词。 空格通常以单词的开头分组（例如，<code> is</code> 而不是 <code>is</code> 或 <code> </code>+ <code>is</code>。 您可以在[OpenAI Tokenizer]快速检查字符串是如何tokenize的。 
OpenAI Tokenizer : https://beta.openai.com/tokenizer 
<hr> 
<img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/e81f96dcb632484e8d1d463b4c0d5145.png" alt="在这里插入图片描述" width="500"> 
<hr> 
<h2>
<a id="_100"></a>使用</h2> 
<h3>
<a id="_102"></a>编解码</h3> 
<pre><code class="prism language-python">import tiktoken
</code></pre> 
<pre><code class="prism language-python"># 使用名字加载 encoding
# 第一次运行时，可能需要连接互联网来下载；下一次不需要联网
encoding = tiktoken.get_encoding("cl100k_base")# 对于给定的模型名，自动加载正确的 encoding 
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")# 将文本转化为 tokens 列表
encoding.encode("tiktoken is great!")
# [83, 1609, 5963, 374, 2294, 0]
</code></pre> 
<pre><code class="prism language-python"># 计算 encode 返回列表的长度
def num_tokens_from_string(string: str, encoding_name: str) -&gt; int:"""Returns the number of tokens in a text string."""encoding = tiktoken.get_encoding(encoding_name)num_tokens = len(encoding.encode(string))return num_tokens
</code></pre> 
<pre><code class="prism language-python">num_tokens_from_string("tiktoken is great!", "cl100k_base") # 6
</code></pre> 
<hr> 
<pre><code class="prism language-python"># 将 tokens 转化为 文本
encoding.decode([83, 1609, 5963, 374, 2294, 0])
# 'tiktoken is great!'
</code></pre> 
<hr> 
警告：尽管 <code>.decode()</code> 可以应用于单个令牌，但要注意，对于不在utf-8边界上的令牌，它可能会有损耗。 对于单个 tokens，<code>.decode_single_token_bytes()</code> 方法安全地将单个整数令牌转换为它所代表的字节。 
<pre><code class="prism language-python">[encoding.decode_single_token_bytes(token) for token in [83, 1609, 5963, 374, 2294, 0]]
# [b't', b'ik', b'token', b' is', b' great', b'!']
</code></pre> 
（字符串前面的 <code>b</code> 表示字符串是字节字符串。） 
<hr> 
<h3>
<a id="_encodings_161"></a>比较 encodings</h3> 
不同的编码在拆分单词、组空格和处理非英语字符的方式上各不相同。使用上面的方法，我们可以比较几个示例字符串的不同编码。 
<pre><code class="prism language-python">def compare_encodings(example_string: str) -&gt; None:"""Prints a comparison of three string encodings."""# print the example stringprint(f'\nExample string: "{example_string}"')# for each encoding, print the # of tokens, the token integers, and the token bytesfor encoding_name in ["gpt2", "p50k_base", "cl100k_base"]:encoding = tiktoken.get_encoding(encoding_name)token_integers = encoding.encode(example_string)num_tokens = len(token_integers)token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]print()print(f"{encoding_name}: {num_tokens} tokens")print(f"token integers: {token_integers}")print(f"token bytes: {token_bytes}")</code></pre> 
<hr> 
<pre><code class="prism language-python">compare_encodings("antidisestablishmentarianism")
</code></pre> 
<pre><code class="prism language-python">Example string: "antidisestablishmentarianism"gpt2: 5 tokens
token integers: [415, 29207, 44390, 3699, 1042]
token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']p50k_base: 5 tokens
token integers: [415, 29207, 44390, 3699, 1042]
token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']cl100k_base: 6 tokens
token integers: [519, 85342, 34500, 479, 8997, 2191]
token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']
</code></pre> 
<hr> 
<pre><code class="prism language-python">compare_encodings("2 + 2 = 4")
</code></pre> 
<pre><code class="prism language-python">Example string: "2 + 2 = 4"gpt2: 5 tokens
token integers: [17, 1343, 362, 796, 604]
token bytes: [b'2', b' +', b' 2', b' =', b' 4']p50k_base: 5 tokens
token integers: [17, 1343, 362, 796, 604]
token bytes: [b'2', b' +', b' 2', b' =', b' 4']cl100k_base: 7 tokens
token integers: [17, 489, 220, 17, 284, 220, 19]
token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']
</code></pre> 
<hr> 
<pre><code class="prism language-python">compare_encodings("お誕生日おめでとう")
</code></pre> 
<pre><code class="prism language-python">Example string: "お誕生日おめでとう"gpt2: 14 tokens
token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]
token bytes: [b'\xe3\x81', b'\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97', b'\xa5', b'\xe3\x81', b'\x8a', b'\xe3\x82', b'\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8', b'\xe3\x81\x86']p50k_base: 14 tokens
token integers: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]
token bytes: [b'\xe3\x81', b'\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97', b'\xa5', b'\xe3\x81', b'\x8a', b'\xe3\x82', b'\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8', b'\xe3\x81\x86']cl100k_base: 9 tokens
token integers: [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]
token bytes: [b'\xe3\x81\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97\xa5', b'\xe3\x81\x8a', b'\xe3\x82\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8\xe3\x81\x86']
</code></pre> 
<hr> 
<h3>
<a id="chat_APItokens_268"></a>计算chat API调用的tokens</h3> 
ChatGPT models like <code>gpt-3.5-turbo</code> and <code>gpt-4</code> use tokens in the same way as older completions models, but because of their message-based formatting, it’s more difficult to count how many tokens will be used by a conversation. 
Below is an example function for counting tokens for messages passed to <code>gpt-3.5-turbo-0301</code> or <code>gpt-4-0314</code>. 
Note that the exact way that tokens are counted from messages may change from model to model. Consider the counts from the function below an estimate, not a timeless guarantee. 
像 <code>gpt-3.5-turbo</code> 和 <code>gpt-4</code> 这样的ChatGPT模型使用tokens 的方式与旧的完成模型相同，但由于它们基于消息的格式，很难计算会话将使用多少tokens。 下面是一个示例函数，用于对传递到 <code>gpt-3.5-turbo-0301</code> 或 <code>gpt-4-0314</code> 的消息的tokens进行计数。 请注意，从消息中计算tokens的确切方式可能会因模型而异。将函数中的计数视为一个估计值，而不是一个永恒的保证。 
<hr> 
<pre><code class="prism language-python">def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):"""Returns the number of tokens used by a list of messages."""try:encoding = tiktoken.encoding_for_model(model)except KeyError:print("Warning: model not found. Using cl100k_base encoding.")encoding = tiktoken.get_encoding("cl100k_base")if model == "gpt-3.5-turbo":print("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")elif model == "gpt-4":print("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")return num_tokens_from_messages(messages, model="gpt-4-0314")elif model == "gpt-3.5-turbo-0301":tokens_per_message = 4 # every message follows &lt;|start|&gt;{role/name}\n{content}&lt;|end|&gt;\ntokens_per_name = -1 # if there's a name, the role is omittedelif model == "gpt-4-0314":tokens_per_message = 3tokens_per_name = 1else:raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")num_tokens = 0for message in messages:num_tokens += tokens_per_messagefor key, value in message.items():num_tokens += len(encoding.encode(value))if key == "name":num_tokens += tokens_per_namenum_tokens += 3 # every reply is primed with &lt;|start|&gt;assistant&lt;|message|&gt;return num_tokens
</code></pre> 
<hr> 
<pre><code class="prism language-python"># let's verify the function above matches the OpenAI API responseimport openaiexample_messages = [{"role": "system","content": "You are a helpful, pattern-following assistant that translates corporate jargon into plain English.",},{"role": "system","name": "example_user","content": "New synergies will help drive top-line growth.",},{"role": "system","name": "example_assistant","content": "Things working well together will increase revenue.",},{"role": "system","name": "example_user","content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.",},{"role": "system","name": "example_assistant","content": "Let's talk later when we're less busy about how to do better.",},{"role": "user","content": "This late pivot means we don't have time to boil the ocean for the client deliverable.",},
]for model in ["gpt-3.5-turbo-0301", "gpt-4-0314"]:print(model)# example token count from the function defined aboveprint(f"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().")# example token count from the OpenAI APIresponse = openai.ChatCompletion.create(model=model,messages=example_messages,temperature=0,max_tokens=1 # we're only counting input tokens here, so let's not waste tokens on the output)print(f'{response["usage"]["prompt_tokens"]} prompt tokens counted by the OpenAI API.')print()</code></pre> 
<hr> 
<pre><code class="prism language-python">gpt-3.5-turbo-0301
127 prompt tokens counted by num_tokens_from_messages().
127 prompt tokens counted by the OpenAI API.gpt-4-0314
129 prompt tokens counted by num_tokens_from_messages().
129 prompt tokens counted by the OpenAI API.
</code></pre> 
<hr> 
<h3>
<a id="_tiktoken_388"></a>拓展 tiktoken</h3> 
您可能希望扩展 tiktoken 以支持新的编码。有两种方法可以做到这一点。 按照您想要的方式创建Encoding对象，然后简单地传递它。 
方式一： 
<pre><code class="prism language-python">cl100k_base = tiktoken.get_encoding("cl100k_base")# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(# If you're changing the set of special tokens, make sure to use a different name# It should be clear from the name what behaviour to expect.name="cl100k_im",pat_str=cl100k_base._pat_str,mergeable_ranks=cl100k_base._mergeable_ranks,special_tokens={**cl100k_base._special_tokens,"&lt;|im_start|&gt;": 100264,"&lt;|im_end|&gt;": 100265,}
)
</code></pre> 
<hr> 
方式二： 使用 tiktoken_ext 插件机制 向tiktoken注册Encoding对象。 只有当您需要 <code>tiktoken.get_encoding</code> 来查找您的编码时，这才有用，否则更适合上面方式1。 要做到这一点，您需要在 <code>tiktoken_ext</code> 下创建一个命名空间包。 这样布局你的项目，确保省略 <code>tiktoken_ext/__init__.py</code>文件： 
<pre><code class="prism language-python">my_tiktoken_extension
├── tiktoken_ext
│ └── my_encodings.py
└── setup.py
</code></pre> 
<hr> 
<code>my_encodings.py</code> 应该是一个包含名为 <code>ENCODING_CONSTRUCTORS</code> 的变量的模块。 这是一个从编码名称到函数的字典，该函数不接受参数，并返回可以传递给 tiktoken.encoding 的参数来构造该编码。 例如，请参阅 <code>tiktoken_ext/openai_public.py</code>。有关详细信息，请参阅 <code>tiktoken/registry.py</code> 。 你的setup.py 应该是这样的： 
<pre><code class="prism language-python">from setuptools import setup, find_namespace_packagessetup(name="my_tiktoken_extension",packages=find_namespace_packages(include=['tiktoken_ext*']),install_requires=["tiktoken"],...
)
</code></pre> 
然后简单地执行 <code>pip install ./my_tiktoken_extension</code>，您应该能够使用自定义编码！请确保不要使用可编辑安装。 
<hr> 
2023-03-31(五)
</div>
<link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/markdown_views-98b95bb57c.css" rel="stylesheet">
<link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/style-c216769e99.css" rel="stylesheet">
</div>
<div id="treeSkill"></div></article>

作者

sockstack

许可协议

CC BY 4.0

发布于

2024-02-27

修改于

2025-08-16

上一篇：软件：常用 Linux 软件汇总，值得收藏下一篇：大规模语言模型微调技术——Instruction和Question的区别和联系

尚未登录

OpenAI - tiktoken ⏳ | fast BPE tokeniser

文章分类

博客重构之路

Spring Boot简单入门

k8s 入门教程

MySQL 知识

NSQ 消息队列

ThinkPHP5 源码分析

使用 Docker 从零开始搭建私人代码仓库

日常开发汇总

标签列表

springboot

hyperf

swoole

webman

php

多线程

数据结构

docker

k8s

thinkphp

mysql

tailwindcss

flowbite

css

前端